SYNCHRONIZATION AND PIPELINE OVERHEAD MEASUREMENTS 
ON THE FPS-5000 MIMD COMPUTER 


by 


I. J. Curington and R. W. Hockney 
Floating Point Systems UK, and Reading University UK 


16 August 1985 


Submitted to PARALLEL COMPUTING 85 
North-Holland, 1985. 


Ls Js Currington 

Floating Point Systems 

Apex House 

London Road 

Bracknell, Berkshire RG12 2TE 
England 


R. W. Hockney 

Department of Computer Science 
Reading University 
Whiteknights Park 

Reading, Berkshire RG6 2AX 
England 


SYNCHRONIZATION AND PIPELINE OVERHEAD MEASUREMENTS 
ON THE FPS-5000 MIMD COMPUTER 


I. J. Curington and R. W. Hockney 
Floating Point Systems UK, and Reading University UK 


ABSTRACT 
A method of characterizing performance of the MIMD architecture of 
the FPS-5000 is presented, using the three parameter r 


v 


Ny /2" and S1/2 method, with further extensions to the fi/2 


parameter, This performance model examines the relationships of 
theoretical vector performance, overheads for filling pipelines, 
overheads for the start and synchronization of multiple 


co-processors, and the effects of memory bandwidth on performance. 
The penalty for MIMD computation can be measured by the cost of 
synchronization of multiple instruction streams in terms of lost 
work. The parameter 4/2 is used to measure synchronization, 


while the figure My fa is a measure of the initial delay required 
to start a vector operation. The parameter — is the measure 
of asymptotic performance, and the parameter f4/2 is the measure 


of floating point operations per memory reference to achieve half 
the asymptotic performance. Results are presented with favorable 
comparisons to similar measurements of other supercomputer 
architectures. 


INTRODUCTION 

Although the scientific computer industry has used MFLOPS 
(millions of floating point operations per second) as a single 
performance parameter for years, it has long been shown inadequate 
to measure performance on a vector or parallel computer 
architecture [1,2,4], due to ignoring vector startup overhead. 
Although manufacturers still do not publish two or three parameter 
descriptions, these figures have been measured on a number of 
different supercomputers [4,5,6]. The movement in the computer 
industry to exploit parallel architectures for higher performance 
has left a gap in our ablility to characterize to performance by 
MFLOPS alone. 


A more accurate model for characterizing performance is presented 
here, using two parameters to measure asymptotic performance and 
vector startup costs, and an additional two parameters to describe 
Multiple processor synchronization costs and memory bandwidth 
efficiency. The list below summarizes the terms used in _ the 
FPS-5000 performance analysis. 


SUMMARY OF TERMS: 


E is the asymptotic performance in MFLOPS, as vector 
length goes to infinity. 


ny /2 is the startup delay for a vector operation, 
measured in units of floating point operation lost during the 
startup time. 


1/2 is used to measure synchronization. A measure of | 
the number of floating-operations required to achieve 50% 
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of ry, on an MIMD architecture. 
on is the number of floating-point operations performed per 
Main memory reference. 


oo (f) is the asymptotic performance in MFLOPS for 
a given f. 


A 
T 0 is the asymptotic performance in MFLOPS as f goes 
to infinity. 


ARCHITECTURE 

The FPS-5000 MIMD architecture is composed of a central Control 
Processor (CP), several XP32 co-processors, a GPIOP I/O processor, 
and a large System Common Memory (SCM) (See Figure 1). The model 
under test was an FPS-5230A and a VAX 11/780 computer system was 
used for initiation for the test programs, and accumulation of 
timing results. 


GPIOP Timer 


as aes easy XP32 
VAX 11/780 | Control Processor | en | 
Local Memory 


CONTROL =-=--— 


para <> 


System Common 
MEMORY 


XP32 
Local Memory . 


Figure 1. FPS-5000 MIMD Architecture Diagram. 


The Control Processor is an SIMD architecture with a separate 
program storage memory, a floating point adder and a floating 
point multiplier. Both arithmetic elements produce results at 6 
MHz. The Control Processor directly executes floating point 
operations, as well as providing control to the other parts of the 
FPS-5000. 


The XP32 co-processor is also an SIMD architecture with a separate 
program storage memory, two floating point adders and one floating 
point multiplier. All three of these produce results at _ the 
instruction clock rate of 6 MHz. Unlike the CP, the XP32 contains 
a high speed local memory, and a separate controller for movement 
of data between the local and the System Common Memory (SCM). One 
FPS-5000 chassis may contain up to three XP32s, however the tests 
presented in this paper used an FPS-5320A configuration which 
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contains two XP32s. 


The FPS-5000 system architecture contains an SISD class 
programmable I/O processor, the GPIOP, with a separate instruction 
stream from the other system elements. For purposes of 
performance analysis, the GPIOP is used to count machine cycles, 
recording elapsed time to an accuracy of one microsecond. The 
timing results are collected by the Control Processor, then passed 
to the VAX for final reporting. 


The primary data storage area of the FPS-5000, directly accessible 
by the VAX, Control Processor, and XP32s is the System Common 
Memory. Vector arithmetic operations in the Control Processor, as 
those used in this analysis, operate directly upon this memory. 
As the XP32 operates on local data storage, data is moved from the 
SCM to the XP32 and back for processing. 


PARAMETERIZATION 

An SIMD computer uses a single instruction stream to control 
multiple arithmetic units, all operating concurently on distinct 
data elements. Both the Control Processor (C.P.) and the xXP32 
co-processor in the FPS-5000 have SIMD architectures, in the form 
of pipelined floating point arithmetic elements. Such an 
architecture can be characterized by the two parameter model of 
(Loo! Ny /2) [4]. In this case, the time t for a vector 


operation of length n is approximated by 


= oe | 
t = E ows (n + Ny /2) . 

In an MIMD computer, such as the FPS-5000, each individual 

processor may be characterized by such a two parameter model, but 

an additional parameter is needed to describe synchronization and 

control overhead in using multiple processors for a particular 

operation. The parameter $1/2 characterizes the synchronization 


overhead much like the vector startup overhead, but for the case 
where the vector operation work is shared by multiple processors. 
The critical path through an MIMD program can be considered as a 
sequence of work segments, between which program synchronization 
must occur [6]. Within a work segment, each processor works 
independently on a particular task. The synchronization includes 
the Control Processor making requests to each of the xXP32 
co-processors and to itself, then recognizing that all processors 
have completed their tasks, or becoming synchronized again. The 
time taken to perform this synchronization with the Control 
Processor may be measured independently of the asymptotic 
performance of the the individual processors, or the combined 
asymptotic performance of the complete system. The estimate of 
the time t for a work segment consisting of s floating point 
operations is 


_ iL 
t = C 40 (s + $1/2) ; 
These estimates usually depend on the type of vector operation 
being performed, and what architectural elements hold the data. 
Over the broad range of algorithms for which a characterization is 
desired, an additional parameter is necessary, one which will 
characterize the proportion of arithmetic work to data movement in 
and out of the main memory. The XP32 co-processor arithmetic 
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elements operate on data stored in local high speed memory, while 
a separate controller moves data (usually in blocks) to or from 
the main System Common Memory (SCM). The number of floating point 
Operations on each data element transfered forms the ratio f, 

which will directly effect any measurement of and 

The value of f for which half of the eames nortorhehes is 
reached is termed fi /2° 


SOFTWARE 

The tests were conducted using the simplest of software interfaces 
available on the array processor architecture, with no 
exploitation of low-level languages or non-standard communication 
mechanisms [3]. The tests were written in FORTRAN-77 using the 
CPFORTRAN cross compiler for the Control Processor. Vector 
library calls and XP32 operation calls are made directly from 
FORTRAN, so the Ni/2 vector startup measurements include 


parameter passing and control Structures of the FORTRAN 
programming environment. 


MEASUREMENTS 

Timing measurementS were made on the FPS-5320A for various 
configurations and two types of vector operators. All timing 
results were obtained using the GPIOP I/O Processor, running a 
program to measure elapsed clock cycles, with timing accurate to 
one microsecond. Two types of arithmetic operations were tested: 
vector multiply, and vector scalar multiply and scalar add. The 
vector multiply (VMUL and ZVMUL) operators are dyadic, forming the 
element-by-element product of two input vectors, as 
C.=B,*B yr for i=l to n. The performance of this operator is 


limited by memory references, since it requires two reads and one 
write for every multiply. In addition, the floating-point adders 
are idle during this process, so the asymptotic rate — is well 
below maximum. 


The vector scalar multiply and scalar add (VSMSA and ZVSASM) 
Operator has the same number of floating operations as memory 
references, and is able to use the multiplier and adder elements 
in parallel, performing D,=A,*B + C, for i=l to nu 


VMUL/ZVMUL from CPFORTRAN: 


CP Only 

1 - XP32 

2 - XP32s 

2 - XP32s + CP 


IT - CRAY-1 Only 
2 - CPU CRAY X-MP22 
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Figure 2. Vector Multiply Performance Graph. 


The linear relations shown in figure 2 indicate the vector 
multiply performance and synchronization overheads for various 
configurations. The CRAY timing has been performed using the same 
two parameter model, with the dual CPU synchronization using the 
TASK method [7]. The CP compares favorably with the CRAY-1 with a 
lower vector startup overhead, while the 2 XP32 $1 /2 is much 


lower than the CRAY X-MP22, showing the effects of much tighter 
hardware and software coupling between the individual CPUs in the 
FPS-5000. 
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Figure 3. Vector Scalar Multiply Scalar Add Performance Graph. 
VSMSA/ZVSASM from CPFORTRAN: 


309 39.6 


CP Only 
1 - XP32 


2 - XP32s 
2 - XP32s + CP 


The linear relations shown in figure 3 indicate the vector scalar 
multiply and scalar add performance and synchronization overheads 
for various configurations. The $12 Measurements for this 


operator are higher than for the vector multiply measurements due 
to equivalent startup times, but higher asymptotic performance. 


ZVSASM, XPDMAR on 1 - XP32 from CPFORTRAN: 


Sequential I/O & ZVSASM 125 4.2 
Overlapped I/O & ZVSASM 12.6 awe 


NOTE: Results are deliberately rounded to only three significant 
figures. 
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Figure 4. XP32 to System Common Memory Bandwidth Graph. 


Figure 4 shows the relationship between memory bandwith and 
asymptotic performance for the ZVSASM operator for one XP32 
co-processor. The overlapped I/O case uses the controller on the 
XP32 to move data in and out during the ZVSASM operator execution, 
using double buffering, while the sequential measurements are for 
the case where this is not possible. For the sequential test, 
data is moved into the into the XP32, the operation is performed, 
and the data is returned to §.C.M. with no overlap. The number 
of I/O calls compared with the number of ZVSASM calls determines 
the value f£. As the lower curve of Figure 4 shows, the measured 
data fits the predicted curve very closely, where 


Ej ar, tHE, EN) 


Eon | 
The overlapped I/O and computation test exploited the 
architectural features of the XP32 in which the local memory has 
sufficient bandwidth to allow both I/O and computation 


concurrently. The fi/2 parameter is half the non-overlapped 


case, since the I/O is concurrent with, not added to computation 
time. Also, the form of the graph is different from _ the 
non-overlapped model with two distict linear curves, joined at the 
point where computation equals I/O time. The first part of the 
graph is linearly increasing performance as more computations are 
performed per data transfer. The f1/2 has its value at the 


point where this linear ramp equals 50% of the peak value. When 
the computation time exceeds I/O time the asymptotic performance 
is purely a function of the CPU, as transfer time is hidden. 


Page 8 


CONCLUSIONS 

A performance model and measurments have been reported for the 
asymptotic performance, vector startup overhead, MIMD 
synchronization, and memory bandwidth in the FPS-5000 MIMD array 
processor. The XP32 co-processor iS approximately three times 
faster than the Control Processor on the vector operations 
measured, and two XP32s double the performance. The startup and 
synchronization costs of using the XP32 are many times’ higher 
using CPFORTRAN calls than in-line calls using the Control 
Processor. The synchronization mechanisms are OF better 
efficiency to the dual CPU CRAY X-MP, while maintaining a simple 
FORTRAN interface. 
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